Introduction

The Healthy Brain Network data set contains data from 2577 participants who have each participated in a subset of 119 available assessments. The assessments range in both purpose and cost, and the goal of this analysis is to identify a set of assessments that is both inexpensive and reasonably predictive of learning disabilities. Value to the customer (both clinicians and patients) comes from alleviating the need for expensive, time-consuming assessments when they aren’t required.

library(tidyverse)
library(plotly)
# library(raster) % not loaded in its entirety; use "raster::"

Data Files

This analysis starts with a clean copy of the most recent release - HBN Release 7.0.

setwd(dir = "C://Users//Mike//Dropbox//PhD//hbn-analysis/")
all_files <- dir("./hbn-data-rel-7-20191018/")
# One file describing "DailyMeds" seems particularly 
# problematic due to multiple EID columns.
files_to_ignore <- c("DailyMeds")
for(file in files_to_ignore){
  all_files <- all_files[!str_detect(all_files, file)]
}
# Extract test name abbreviations.
test_names <- all_files %>% 
  str_replace("9994_", "") %>% 
  str_replace("_20191018.csv", "")

Study Participant Identification

The file structure is such that each assessment is described by a unique file. Only the participants associated with each assessment will be listed in the assessment file. We’ll compile a master list of all participants by considering the unique EIDs across all assessments. The common key for all assessment files is the “EID” column. We’ll first validate the number of unique EIDs across all assessment files.

EID_list <- NULL
file_count <- 0
for(file in all_files){
  file_count <- file_count + 1
  # print(str_c("file ", file_count, " of ", length(all_files), " part 1"))
  test_list <- read_csv(str_c("./hbn-data-rel-7-20191018/", file)) %>% 
    select(EID) %>% 
    pull()
  EID_list <- c(EID_list, test_list)
}
EID_list <- EID_list %>% unique() %>% sort()
EID_list <- EID_list[grepl("^ND.*", EID_list)]
length(EID_list)
## [1] 2577
# 2577 unique EIDs.

Test Participation by Participant

We know there are 119 tests and 2577 unique participants, and we’d now like to get a feel for how many people have taken each test. Ultimately we want to find a subset of tests that a large portion of the study population have all completed. Eventually we could consider imputing small amounts of missing data, but let’s begin with a complete case approach. It is important to note that not all tests are appropriate for all participants (e.g., tailored to a specific sex or age group).

# Create the desired data frame structure.
EID_test_participation <- matrix(NA, nrow = length(EID_list), 
                                 ncol = length(all_files) + 1)
EID_test_participation <- as_tibble(EID_test_participation)
colnames(EID_test_participation) <- c("EID", test_names)
EID_test_participation$EID <- EID_list
# Populate the test participation data frame
file_count <- 0
for(file in all_files){
  file_count <- file_count + 1
  # print(str_c("file ", file_count, " of ", length(all_files), " part 2"))
  test_list <- read_csv(str_c("./hbn-data-rel-7-20191018/", file)) %>% 
    select(EID) %>% 
    pull()
  EID_test_participation[, file_count + 1] <- 
    EID_test_participation$EID %in% test_list
}

test_participation <- tibble(test = test_names,
                             num_participants = 
                               apply(EID_test_participation[,-1], 2, sum)) %>% 
  arrange(-num_participants) %>% 
  mutate(test = fct_reorder(test, rev(order(num_participants))))

Plot observed counts to display test participation (least to greatest).

plot1 <- ggplot(test_participation) + 
  geom_point(aes(x = test, y = num_participants)) +
  labs(title = "Number of Participants for Each HBN Assessment",
       x = "Assessment Title",
       y = "Number of Participants") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  coord_flip()
ggplotly(plot1)

Pairwise Co-Participation in Assessments

Let’s examine how many assessments each pair of participants have in common. This may give us a general idea of how many of the 119 assessments we might be able to use for a certain-sized subset of the study population.

EID_test_matrix <- EID_test_participation %>% 
  select(-EID) %>% 
  as.matrix()
# Sort the matrix row-wise by highest to lowest row sums.
EID_test_sum <- apply(EID_test_matrix, 1, sum) %>% order()
EID_test_matrix <- EID_test_matrix[EID_test_sum, ]
EID_pairwise <- EID_test_matrix %*% t(EID_test_matrix)
# Reverse the columns.
EID_pairwise <- EID_pairwise[,length(EID_test_sum):1]
par(mar = c(0,0,4,0))
raster::plot(x = raster::raster(EID_pairwise), axes = FALSE, box = FALSE, xlab = "Participants", ylab = "Participants", main = "Pairwise Co-Participation in Tests\n(Participant Index vs. Reordered Participant Index;\nNumber of Tests)")

This figure should be somewhat concerning. Here we see only the bottom left quadrant (roughly one quarter of pairwise calculations) sharing ~50 or more assessments at the subject-subject pair level (out of 119 possible assessments). There are no guarantees that the shared tests in any of these pixels of similar color refer to the same set of shared tests between other subject-subject pairs. As an alternate view of the same distribution, we’ll construct a histogram of pairwise co-participation counts.

max_shared_assessments <- max(EID_pairwise)
par(mar = c(5.1, 4.1, 4.1, 2.1))
hist(EID_pairwise, breaks = 0:max_shared_assessments, 
     main = "Pairwise Co-Participation in Tests", 
     xlab = "Number of Shared Tests")

Again we have a large majority of pairs sharing less than 45-50 shared assessments at the subject-subject pair level. A good guess at this point might be that we can find many hundreds of people who’ve all taken the same dozens of tests. There will no doubt be a tradeoff in which increasing the sample size forces tests to be dropped due to lack of sufficient participation.

Usable Assessments at Various Sample Sizes

We’d like to now find the set of assessments that we can use for various sample sizes. For now we’ll assume we want complete cases (i.e., no missing assessments for any participants). It is worth noting that this does not imply anything more than a subject’s EID number is on the roster for having taken the assessment. There could still be missing values or data entry errors.

# Sample Size Options
test_n <- c(seq(2500,100,-100))
# How many people do we get?
n <- rep(0, length(test_n))
# How many assessments do we get?
p <- rep(0, length(test_n))
test_name_list <- list()

The heuristic below simply considers all tests that had an overall participation exceeding \(n\). All people who completed all such tests will be retained. We return three things: a study size, a number of tests completed by all in the subset, and the names of the completed tests.

test_counts <- apply(EID_test_matrix, 2, sum)
for(test_i in 1:length(test_n)){
  this_test_n <- test_n[test_i]
  top_n_tests <- which(test_counts > this_test_n)
  EID_test_matrix_pruned <- EID_test_matrix[,top_n_tests]
  for(i in 1:nrow(EID_test_matrix_pruned)){
    for(j in 1:ncol(EID_test_matrix_pruned)){
      EID_test_matrix_pruned[i,j] <- 
        ifelse(EID_test_matrix_pruned[i,j] == TRUE, TRUE, NA)
    }
  }
  EID_test_matrix_pruned <- EID_test_matrix_pruned[complete.cases(EID_test_matrix_pruned),]
  n[test_i] <- nrow(EID_test_matrix_pruned)
  p[test_i] <- ncol(EID_test_matrix_pruned)
  test_name_list[[test_i]] <- colnames(EID_test_matrix_pruned)
}

Let’s take a look at the relationship between \(n\) and \(p\) now that we have a few different-sized complete-case subsets.

ggplot() + 
  geom_point(aes(n, p)) + 
  labs(title = "Complete Tests vs. Study Size",
     y = "Number of Completed Tests", 
     x = "Study Population Completing All Tests")

Let’s consider the two scenarios with 471 and 1592 participants who all completed 52 and 28 assessments, respectively.

Scenario 1: 471 subjects, 52 assessments, assessment list

n[10] # Number of Subjects
## [1] 471
p[10] # Number of Common Assessments
## [1] 52
test_name_list[[10]] # All of these subjects completed all of these assessments.
##  [1] "APQ_SR"                     "ARI_P"                     
##  [3] "ARI_S"                      "ASSQ"                      
##  [5] "Barratt"                    "Basic_Demos"               
##  [7] "BIA"                        "C3SR"                      
##  [9] "CBCL"                       "CELF_5_Screen"             
## [11] "CIS_P"                      "ColorVision"               
## [13] "ConsensusDx"                "CTOPP_2"                   
## [15] "DTS"                        "EEG_TRACK"                 
## [17] "EHQ"                        "ESWAN"                     
## [19] "FitnessGram"                "FSQ"                       
## [21] "MFQ_P"                      "MRI_Track"                 
## [23] "NIH_Full"                   "NIH_Scores"                
## [25] "NIH_Scores_20191018_v2.csv" "NLES_P"                    
## [27] "PBQ"                        "PCIAT"                     
## [29] "Pegboard"                   "Physical"                  
## [31] "PPS"                        "PreInt_Demos_Fam"          
## [33] "PreInt_Demos_Home"          "PreInt_DevHx"              
## [35] "PreInt_EduHx"               "PreInt_FamHx_RDC"          
## [37] "PreInt_Lang"                "PreInt_TxHx"               
## [39] "PSI"                        "SAS"                       
## [41] "SCARED_P"                   "SCARED_SR"                 
## [43] "SCQ"                        "SDQ"                       
## [45] "SDS"                        "SRS"                       
## [47] "SWAN"                       "WHODAS_P"                  
## [49] "WIAT"                       "WISC"                      
## [51] "WISC_20191018_v2.csv"       "YFAS_C"

Scenario 2: 1592 subjects, 28 assessments, assessment list

n[5] # Number of Subjects
## [1] 1592
p[5] # Number of Common Assessments
## [1] 28
test_name_list[[5]] # All of these subjects completed all of these assessments.
##  [1] "APQ_SR"                     "ARI_P"                     
##  [3] "ARI_S"                      "ASSQ"                      
##  [5] "Barratt"                    "Basic_Demos"               
##  [7] "CBCL"                       "CELF_5_Screen"             
##  [9] "ColorVision"                "ConsensusDx"               
## [11] "CTOPP_2"                    "EEG_TRACK"                 
## [13] "EHQ"                        "FitnessGram"               
## [15] "NIH_Scores"                 "NIH_Scores_20191018_v2.csv"
## [17] "Pegboard"                   "Physical"                  
## [19] "PreInt_Demos_Fam"           "PreInt_Demos_Home"         
## [21] "PreInt_DevHx"               "PreInt_EduHx"              
## [23] "PreInt_TxHx"                "SCQ"                       
## [25] "SDQ"                        "SRS"                       
## [27] "SWAN"                       "WIAT"

Clearly there is a substantial reduction in sample size as the desired number of assessments increases. The challenge now will be to retain as many subjects and as many assessments as possible given a specific research question.

Forming a Hypothesis

We can take either of two approaches at this point: 1) throw in all the data and look for “something” or 2) look for something specific for which we think the data we have is relevant. Option 2 is more likely to produce a clinically relevant result. The next step should be to consider the sample sizes above (or others) and their respective assessment lists and develop an appropriate question given the available data.

Next Steps

From our initial meeting, a stated goal was to predict cognitive/language task performance using only the NIH Toolbox scores and other assessments outside the cognition/language category (excluding CBCL). From the image below, the targeted cognitive/language outcomes include:

  • Temporal Discounting
  • ACE
  • WISC-V
  • WAIS
  • WIAT
  • DAS
  • CELF-5
  • GFTA
  • CTOPP
  • TOWRE
  • EVT
  • PPVT
knitr::include_graphics("hbn-assessment-list.PNG", )

Considering the list of assessments identified previously with 471 participants, the eligible cognitive/language assessment outcomes we can try to predict include:

  • WISC-V
  • WIAT
  • CELF-5
  • CTOPP

In the larger group of participants (n=1592), we’d have just over half as many eligible predictors to predict three of these assessments:

  • WIAT
  • CELF-5
  • CTOPP

In the image below, it appears that WISC and WIAT require the most time (60-75 minutes and 45-60 minutes, respectively) and present the greatest opportunity for savings (~1 hour for visits 1 and 3), so proceeding with 471 subjects and two time-consuming targets is our current recommendation. We can also optimize over these targets (WISC and WIAT) to grow the predictor set (other assessments) either as large as possible (requires no domain expertise) or perhaps with some intended predictors in mind. This effectively changes the question from, “Does any subset of X predict Y?” to “Do X1, X2, and X3 predict Y?” Choosing X1, X2, and X3 intelligently requires domain expertise. We may undesirably discard one or more of them due to low participation if we don’t have another reason to try to keep them.

knitr::include_graphics("hbn-time-commitment.PNG")